πŸ•ΈοΈ Ada Research Browser

decisions.md
← Back

Architecture Decision Records

This document captures key architectural decisions for the Secure Runtime Environment (SRE) platform. Each decision follows the ADR format: context, options considered, decision, and consequences.


ADR-001: Kubernetes Distribution β€” RKE2

Date: 2025-02-19 Status: Accepted NIST Controls: CM-2, CM-6, SC-3, SI-7

Context

The platform requires a Kubernetes distribution that meets DISA STIG requirements, supports FIPS 140-2 cryptographic modules, runs on air-gapped networks, and has a clear path to government accreditation.

Options Considered

  1. RKE2 β€” Rancher's government-focused distribution with built-in FIPS support, CIS hardening profile, and DISA STIG compliance out of the box.
  2. Kubeadm + manual hardening β€” Upstream Kubernetes with custom STIG application. Maximum flexibility but significant operational overhead.
  3. OpenShift β€” Red Hat's enterprise distribution. Strong government presence but requires RHEL subscription, is opinionated about tooling, and has a much larger footprint.
  4. K3s β€” Lightweight Rancher distribution. Excellent for edge but lacks FIPS mode and STIG compliance focus.

Decision

RKE2. It provides FIPS 140-2 compliance via GoBoring Go compiler, ships with a CIS 1.23 hardening profile enabled by default, has published DISA STIG benchmarks, includes an embedded etcd (no external dependency), and runs on Rocky Linux 9 which itself has a published STIG.

Consequences


ADR-002: GitOps Engine β€” Flux CD

Date: 2025-02-19 Status: Accepted NIST Controls: CM-2, CM-3, CM-5, SA-10, SI-7

Context

The platform needs a GitOps engine to continuously reconcile cluster state against a Git repository, providing auditability, drift detection, and declarative infrastructure management.

Options Considered

  1. Flux CD β€” CNCF graduated. Pull-based reconciliation, native Kubernetes CRDs (HelmRelease, Kustomization), multi-tenancy support.
  2. Argo CD β€” CNCF graduated. Web UI, application-centric model, SSO integration, RBAC on applications.
  3. Fleet (Rancher) β€” Bundled with Rancher. Simplifies multi-cluster but tightly coupled to Rancher ecosystem.

Decision

Flux CD. Its CRD-native approach aligns with our "everything is a Kubernetes resource" philosophy. HelmRelease CRDs provide declarative Helm management with built-in health checks and remediation. Flux's lighter footprint and lack of a stateful UI component reduces the attack surface. Multi-tenancy is handled at the Kustomization level with service account impersonation.

Consequences


ADR-003: Policy Engine β€” Kyverno

Date: 2025-02-19 Status: Accepted NIST Controls: AC-3, AC-6, CM-6, CM-7, SI-7

Context

The platform needs admission control to enforce security policies, validate resource configurations, mutate resources for compliance, and verify container image signatures.

Options Considered

  1. Kyverno β€” Kubernetes-native, YAML-based policies, built-in image verification, mutation support, policy reporting CRDs.
  2. OPA Gatekeeper β€” Rego language for policies, ConstraintTemplate CRDs, mature ecosystem.
  3. Kubewarden β€” Wasm-based policies, language-agnostic, newer project.

Decision

Kyverno. YAML-based policies are more accessible to the platform team than Rego (OPA) and align with our GitOps approach. Built-in Cosign image verification eliminates the need for a separate admission webhook. ClusterPolicy and Policy CRDs provide flexible scoping. PolicyReport CRDs integrate directly with monitoring for compliance dashboards.

Consequences


ADR-004: Secrets Management β€” OpenBao + External Secrets Operator

Date: 2025-02-19 Status: Accepted NIST Controls: IA-5, SC-12, SC-13, SC-28

Context

The platform needs centralized secrets management with dynamic secret generation, automatic rotation, encryption at rest, and Kubernetes-native secret delivery.

Options Considered

  1. OpenBao + ESO β€” OpenBao (open-source Vault fork) for secrets storage + External Secrets Operator for Kubernetes delivery.
  2. HashiCorp Vault + VSO β€” HashiCorp Vault with Vault Secrets Operator. Industry standard but BSL-licensed since 2023.
  3. Sealed Secrets β€” Bitnami's encryption-based approach. Simple but no dynamic secrets, no rotation.
  4. AWS Secrets Manager + ESO β€” Cloud-native but creates cloud provider lock-in.

Decision

OpenBao + ESO. OpenBao is the community fork of HashiCorp Vault created after the BSL license change, maintaining API compatibility while being truly open-source (MPL-2.0). ESO syncs secrets from OpenBao into native Kubernetes Secrets, which applications consume without code changes. This avoids both vendor lock-in (HashiCorp BSL) and cloud lock-in (AWS-only).

Consequences


ADR-005: Service Mesh β€” Istio

Date: 2025-02-19 Status: Accepted NIST Controls: AC-4, SC-7, SC-8, SC-13, AU-2

Context

The platform needs encrypted service-to-service communication (mTLS), traffic management, observability (distributed tracing), and fine-grained authorization policies.

Options Considered

  1. Istio β€” CNCF graduated. Full-featured mesh with mTLS, traffic management, observability, authorization. Largest community and ecosystem.
  2. Linkerd β€” CNCF graduated. Lighter weight, Rust data plane, simpler operational model. No built-in authorization policies.
  3. Cilium Service Mesh β€” eBPF-based, no sidecar, integrated with Cilium CNI. Newer mesh implementation.

Decision

Istio. It provides STRICT mTLS enforcement (satisfying SC-8 encryption in transit), AuthorizationPolicy CRDs for fine-grained access control (AC-4), and RequestAuthentication for JWT validation. Istio's telemetry integration provides automatic metrics, traces, and access logs for all mesh traffic (AU-2). The large ecosystem means better government adoption references and more available expertise.

Consequences


ADR-006: Container Registry β€” Harbor

Date: 2025-02-19 Status: Accepted NIST Controls: CM-2, SI-3, SI-7, SA-11, RA-5

Context

The platform needs a container registry that provides image storage, vulnerability scanning, image signing/verification, replication from upstream registries, and RBAC for multi-tenant access.

Options Considered

  1. Harbor β€” CNCF graduated. Full-featured registry with built-in Trivy scanning, Cosign/Notation signing, replication, RBAC, robot accounts.
  2. Distribution (Docker Registry) β€” Minimal, open-source. No scanning, no signing, no RBAC beyond basic auth.
  3. Quay β€” Red Hat's registry. Feature-rich but smaller community outside Red Hat ecosystem.

Decision

Harbor. It provides integrated Trivy vulnerability scanning (RA-5), Cosign signature verification (SI-7), replication policies for pulling from upstream registries into the air-gapped environment, project-based RBAC for multi-tenant isolation, and robot accounts for CI/CD automation.

Consequences


ADR-007: Runtime Security β€” NeuVector

Date: 2025-02-19 Status: Accepted NIST Controls: SI-3, SI-4, IR-4, IR-5, SC-7

Context

The platform needs runtime security monitoring that detects anomalous container behavior, enforces network microsegmentation at Layer 7, provides DLP/WAF capabilities, and integrates with the incident response workflow.

Options Considered

  1. NeuVector β€” SUSE open-source. Full lifecycle container security: vulnerability scanning, runtime protection, network DLP/WAF, behavioral learning.
  2. Falco β€” CNCF graduated. Runtime threat detection via system call monitoring. Alert-only (no enforcement).
  3. KubeArmor β€” CNCF sandbox. eBPF/LSM-based enforcement. Newer with smaller community.

Decision

NeuVector. It provides both detection AND enforcement (Falco is detect-only). Network microsegmentation with DLP/WAF satisfies SI-4 and SC-7. Behavioral learning mode creates baselines that automatically become enforcement rules. Integration with SYSLOG feeds alerts to Alloy/Loki for centralized incident response.

Consequences


ADR-008: Infrastructure as Code β€” OpenTofu

Date: 2025-02-19 Status: Accepted NIST Controls: CM-2, CM-3, SA-10

Context

The platform infrastructure (VPCs, compute, load balancers, DNS, storage) needs to be managed declaratively with state tracking, planning, and drift detection.

Options Considered

  1. OpenTofu β€” Open-source fork of Terraform (MPL-2.0), maintaining full HCL compatibility and provider ecosystem.
  2. Terraform β€” HashiCorp's original IaC tool. BSL-licensed since 2023.
  3. Pulumi β€” Multi-language IaC. TypeScript/Python/Go support. Smaller ecosystem for government infrastructure.
  4. AWS CDK / Azure Bicep β€” Cloud-specific IaC. Locks to a single provider.

Decision

OpenTofu. It's the community fork of Terraform created after the BSL license change, maintaining full compatibility with the HCL language, provider ecosystem, and existing Terraform modules. This aligns with our open-source-only policy and avoids BSL license concerns for government deployment.

Consequences


ADR-009: Operating System β€” Rocky Linux 9

Date: 2025-02-19 Status: Accepted NIST Controls: CM-6, SI-2

Context

The platform needs a base operating system that has a published DISA STIG, supports FIPS 140-2/3 mode, provides long-term support, and is freely available without subscription.

Options Considered

  1. Rocky Linux 9 β€” CentOS successor, RHEL binary-compatible, free, has DISA STIG.
  2. RHEL 9 β€” Red Hat Enterprise Linux. Gold standard for government but requires subscription.
  3. Ubuntu 22.04 β€” Canonical LTS. Popular but DISA STIG trails RHEL-based distros.
  4. Amazon Linux 2023 β€” AWS-optimized. No DISA STIG, AWS lock-in.

Decision

Rocky Linux 9. It's binary-compatible with RHEL 9 (same STIG applies), freely available without subscription, supports FIPS mode, and has a 10-year support lifecycle. The STIG is published and actively maintained by DISA.

Consequences


How to Add New ADRs

Use the /generate-adr slash command in Claude Code:

/generate-adr Should we add Backstage as a developer portal

This will research the decision, create a new ADR entry, and maintain this index.

ADR Numbering

Status Values